Skip to content

Fix revisit content-type#42

Merged
lfoppiano merged 8 commits into
ccfrom
bugfix/revisit-content-type
Mar 28, 2026
Merged

Fix revisit content-type#42
lfoppiano merged 8 commits into
ccfrom
bugfix/revisit-content-type

Conversation

@lfoppiano

Copy link
Copy Markdown

This PR fixed issue #40. I replaced the content type message/http with application/http; msgtype=response.

@sebastian-nagel sebastian-nagel left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @lfoppiano, thanks! The fix looks good to me.

However, the unit test failed when I tried to run it. Creating a "HTTP 304" response object, resp. the included Content object, might be challenging. See the attached data. It's on you whether to read the response from the segment, or to study it to figure out how it should look like. Use nutch readseg for inspection. There are other Nutch unit tests which use segments.

Otherwise, it's very appreciated that there will be now the first "deeper" unit test for the WARC writer.

Comment thread src/test/org/commoncrawl/util/TestWarcWriter.java Outdated
@lfoppiano

lfoppiano commented Feb 24, 2026

Copy link
Copy Markdown
Author

@sebastian-nagel the test was left failing on purpose, because I wanted to first fix the test running.

I managed (or, better, Claude Opus managed - other models weren't able) to find the cause.
In short, all test were ran with <fork> and there were two tests in particular TestCommonCrawlDataDumper whcih called at some point System.exit() dumping the whole forked process. However also TestMimeUtil had a simliar problem.

For now I've separated the org.commoncrawl.* from the org.apache.* tests. But the issue may still occur preventing other tests from the same group to be executed.

@lfoppiano lfoppiano marked this pull request as draft February 25, 2026 07:37
@sebastian-nagel

Copy link
Copy Markdown

called at some point System.exit()

Thanks! That's a left-over of NUTCH-2852. Unclear why it hits our fork but not upstream Nutch. It should be fixed upstream anyway. I'll also comment on PR #44.

@lfoppiano lfoppiano marked this pull request as ready for review February 26, 2026 13:40

@sebastian-nagel sebastian-nagel left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @lfoppiano, thanks! The fix to change the Content-Type in the WARC header looks good.

See the inline comments about the unit test code.

Comment thread src/test/org/commoncrawl/util/TestWarcWriter.java Outdated
Comment thread build.xml Outdated
@lfoppiano lfoppiano force-pushed the bugfix/revisit-content-type branch from 0f0c4b3 to b0110c9 Compare March 23, 2026 14:24
@lfoppiano

Copy link
Copy Markdown
Author

@sebastian-nagel I believe now the test works fine with the data from a real case. 🌮 🎉
I think there are only two item that I did not find:

String payloadDigest = "sha1:abc123";
String blockDigest = "sha1:def456";

@sebastian-nagel

Copy link
Copy Markdown

The digests are calculated one level higher in the WarcRecordWriter and then passed to the WarcWriter. No need to test it as part of this PR.

@sebastian-nagel sebastian-nagel left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @lfoppiano! Looks good to me.

Please merge!

@lfoppiano lfoppiano merged commit d0c4d97 into cc Mar 28, 2026
1 check passed
@lfoppiano lfoppiano deleted the bugfix/revisit-content-type branch March 28, 2026 19:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants